Sample size calculation for a NanoString GeoMx spatial transcriptomics experiment to study predictors of fibrosis progression in non-alcoholic fatty liver disease

Sample size calculation for spatial transcriptomics is a novel and understudied research topic. Prior publications focused on powering spatial transcriptomics studies to detect specific cell populations or spatially variable expression patterns on tissue slides. However, power calculations for translational or clinical studies often relate to the difference between patient groups, and this is poorly described in the literature. Here, we present a stepwise process for sample size calculation to identify predictors of fibrosis progression in non-alcoholic fatty liver disease as a case study. We illustrate how to infer study hypothesis from prior bulk RNA-sequencing data, gather input requirements and perform a simulation study to estimate required sample size to evaluate gene expression differences between patients with stable fibrosis and fibrosis progressors with NanoString GeoMx Whole Transcriptome Atlas assay.


Methods
Initial hypothesis generation. First, we performed analysis to identify which spatial expression patterns could serve as the primary and secondary endpoints.
We explored bulk liver RNA-sequencing cohort from Fujiwara et al. (GSE193066 24 ). Fibrosis progressors and fibrosis regressors had at least + 1 and − 1 fibrosis stage in their circa 2 year follow-up biopsies. The cohort included 28 individuals with stable fibrosis stage, 15 fibrosis regressors and 15 fibrosis progressors. We sought to identify an association between a higher-level gene expression pattern and some element of liver microanatomy or histology. Therefore, we did not make any attempt to control false discovery rate at the level of individual genes or pathways in this analysis. We also used the published DESeq2 rlog-transformed data set as-is without reprocessing. However, we performed a technical method comparison using another data set, GSE135251 25 . We checked that results obtained with linear regression on rlog transformed data and on raw counts with DESeq2 26 were overall consistent, especially with respect to log2 fold change estimates, with this sample size in GSE135251 25 where both raw and normalized data were readily available (Fig. S1). Candidate genes were identified with linear regression on rlog data adjusted for baseline fibrosis stage in GSE193066. The genes were required to have opposite direction of log2 fold change in fibrosis progressors and fibrosis regressors compared to stable fibrosis individuals and raw p < 0.05 in both comparisons. The candidate genes identified in the linear regression analysis in GSE193066 were semi-manually grouped into broad "cytoskeleton" (89 candidate genes) and "lipid metabolism" (44 candidate genes) categories using AmiGO 2 27 and PANTHER 28 based on GO 29,30 and Reactome 31 pathway definitions and hierarchical relationships between the ontology terms. We calculated summary scores for the "lipid metabolism" (44 genes) and the "cytoskeleton" (89 genes) gene sets with Kuppe's method 32 . Briefly, the average was first calculated on rlog normalized expression of all genes within the gene set, then genes with poor correlation (< 0.1) to the average expression of the whole gene set were filtered out and the average was updated based on the remaining genes. We verified that the expression of the "lipid metabolism" score and the "cytoskeleton" score differed between fibrosis progressors, stable fibrosis and fibrosis regressors using linear regression adjusted for both baseline fibrosis stage and baseline NAFLD activity score (NAS).
The candidate "lipid metabolism" and "cytoskeleton" marker sets were derived from bulk liver. Next, we traced potential sources of these expression signals within the liver. We extracted markers of cellular and anatomical hallmarks within liver from literature. Specifically, we obtained markers for liver cell type (hepatocytes, cholangiocytes, T cells etc.) from 33 , for hepatocyte zones in NASH liver 34 and the "core matrisome" markers for extracellular matrix from 32 . We calculated summary scores i.e. average normalized expression, for markers of different cell types, zones within liver and extracellular matrix in the Fujiwara et al. cohort (GSE193066 24 ) and in published Visium liver spatial transcriptomics samples from Wu et al. http:// lifeo me. net/ supp/ liver cancer-st/ data. htm 35 , GSM5616008 from GSE185477 36 and GSE192741 37 . The summary scores (expression averages) were calculated from rlog normalized expression in GSE193066 and from SCT-transformed expression in the Visium data sets. Then, we correlated "lipid metabolism" and "cytoskeleton" scores with the expression scores for liver cell types, ECM and hepatocyte zones. We assumed "guilt-by-association": if summary scores correlated both in bulk and in spatial transcriptomics samples (Spearman method), then the expression signals co-localized.
Simulation study. We made synthetic (artificial) Nanostring GeoMx data. Briefly, we learnt distribution of gene expression values within liver empirically from historic Visium samples and created read count matrices that matched the technical properties of NanoString GeoMx.
Non-tumour liver samples adjacent to hepatocellular carcinoma http:// lifeo me. net/ supp/ liver cancer-st/ data. htm 35 had well-defined fibrosis areas and hepatocyte fraction and satisfied the 100 reads/μm 2 recommendation by NanoString. We used these five samples for the simulation study.
We defined fibrotic niche as Visium spots that were in the 85 percentile or above for ECM score and at or below the 15 percentile of the hepatocyte score. Similarly, hepatocyte spots were at or above the 20 percentile for hepatocyte score and at or below the 80 percentile of the ECM score. The remaining spots were enriched in other cell types (e.g., cholangiocytes) or had low cell density. This resulted in 2.6-7.1% fibrotic niche, which was within the NAFLD fibrosis range 38 , and 63-68% hepatocyte area, which corresponded to the lower range of literature estimates for the hepatocyte fraction by liver volume [39][40][41] . In total, we obtained 819 fibrotic niche spots and 12,074 hepatocyte spots across the 5 samples. To verify that our approach to tissue segmentation was reasonable, we conducted differential expression analysis between "bulk" tissue constructed by summing up counts from fibrotic niche and hepatocyte spots per patient. As expected, fibrotic niche was enriched in markers of diverse non-parenchymal cell types (mesenchymal, immune etc.) and hepatocyte fraction was enriched in hepatocyte markers (Tables S1 and S2).
We translated the expression changes in "lipid metabolism" and "cytoskeleton" summary scores from bulk tissue into expression changes within fibrotic niche and hepatocyte fraction. We found two representative markers PON1 and FLNA that had the most similar spatial expression pattern to the two summary scores identified in initial analysis. All subsequent calculations were based on these two genes.
We spiked-in increasingly large fractions of read counts to PON1 in hepatocytes and FLNA in the fibrotic niche. For example, spiked-in fraction r0.1 for FLNA corresponded to FLNA read counts in each fibrotic niche spot in sample i + 0.1*median(FLNA read counts across fibrotic niche spots in sample i), where i = 1…5. We summed up read counts across all spots per sample to obtain "bulk" tissue. We calculated log2 fold change between spiked-in "bulk" and unmodified "bulk" tissue with DESeq2 using the patient as a fixed effect covariate and constructed calibration curves. We found the point on each calibration curve where fold change in "bulk" tissue matched log2 fold change observed in the initial hypothesis-generating analysis in the Fujiwara et al. cohort 24 . This determined the magnitude of spike-in that we used in the simulation study.
We took the unmodified 12,074 hepatocyte spots and 819 fibrotic niche spots across the five Visium samples as our spot "population" in "stable fibrosis" individuals. The same spots but spiked-in with fractional read counts r0.43 for PON1 and r2.9 for FLNA, respectively, were our spot "population" in "fibrosis regressors" and "fibrosis progressors", respectively.
Initial Visium spot diameter of 55 μm contains 8 to 20 cells 35 . Such low amount of input material is not recommended by NanoString because it leads to only high and moderate abundance genes being detected, variable expression estimates and unreliable differential expression based on a technical benchmark study by the manufacturer ("The GeoMx ® Human Whole Transcriptome Atlas for the Digital Spatial Profiler: Design, Performance, and Experimental Guidelines" document from http:// www. nanos tring. com/ GeoMx DSP). Fibrotic niche was the smaller tissue fraction in our experimental plan and it determined the range of diameters that we evaluated. The width of fibrotic niche in the refence data rarely exceeded 2-3 Visium spots (55 μm*2 + 45 μm distance between spots = 155 μm, 55 μm*3 + 2*45 μm = 255 μm). Reference estimates of fibrotic niche size reported in NAFLD were in the < 200 μm range 38,42,43 . Based on this prior data, we considered 165 μm as maximum spot diameter. The amount of input material (cells) is determined by tissue area that is subjected to sequencing. The relationship between spot diameter and area is quadratic, so we aggregated 2 Visium spots to mimic 80 μm diameter, 4 Visium spots to mimic 110 μm, and 9 Visium spots to mimic 165 μm.
In NanoString terminology, a spot is called "region of illumination", or ROI, and corresponds to the small piece of tissue from which barcodes are harvested for sequencing. We use the term "ROI" to describe the synthetic larger-diameter NanoString data and the term "spot" to describe original 55 μm Visium spots.
We drew random subsamples from our unmodified and spiked-in Visium spot "populations" and summed up counts from neighbouring spots to mimic NanoString ROIs with larger diameter. We made 100 synthetic data sets for each condition (endpoint × N patients × N ROIs × ROI diameter). In the original Visium data, correlations between global gene expression profiles in spots within one patient were not stronger on average than between spots from different patients. So, the new synthetic larger ROIs were randomly assigned to new "patient" or a new ROI from an existing "patient". The simulated data sets were restricted to genes measured on the Nanostring GeoMx Human Whole Transcriptome Atlas assay (GPL32201). Genes that were part of the assay and had median expression < 0.05 TPM in all GTEx tissues (https:// gtexp ortal. org/ home/ 44 ) were used as negative control "background probes" for the GeoDiff method 45 .
Since spatial transcriptomics is a new technology and the NanoString software GeoDiff 45 has not been extensively validated yet, we wanted to increase confidence in the results and applied three methods to test differences in expression levels of FLNA and PON1 between spiked-in and unmodified data sets. Conceptually, all three methods relied on negative binomial mixed effect regression models with background or effective library size treated as a threshold or an offset. Therefore, expression of a gene of interest was corrected for background. In all cases, patient group was treated as the fixed effect (e.g., "fibrosis progressor" vs "stable fibrosis") and patient as the random intercept. Effective library size was calculated with edgeR 46 .
Method2, glmer.nb 47  www.nature.com/scientificreports/ Likelihood ratio test (LRT) between full model assuming that there is a group difference and the null model assuming that gene expression is explained by technical and within-patient variability.
Full model: gene count ~ offset(natural logarithm of effective library size) + group + (1|patient). Null model: gene count ~ offset(natural logarithm of effective library size) + (1|patient). We summarized our simulation setup in Table 1. For the final simulation run, the numbers of repeats/random data sets per condition was increased to 1000.

Results
The overall study workflow is summarised in Fig. 1.

Study endpoints.
Sample size calculation requires definition of the study endpoint. We used a combination of historic bulk RNA-sequencing (GSE193066 24 ) and spatial transcriptomics data [35][36][37] to identify spatial gene expression pattens in baseline biopsies that correlated with fibrosis outcome in NAFLD.
First, we identified markers whose expression at baseline could discriminate individuals with fibrosis progression, regression and stable fibrosis in the future. We aimed to identify markers that carried non-redundant information to histological severity of fibrosis, which is a known risk factor for further fibrosis progression. www.nature.com/scientificreports/ Hence, we stratified baseline biopsies in GSE193066 based on fibrosis outcome in the follow-up biopsy of the same patient. We identified differences in baseline gene expression between progressors, regressors and individuals with stable fibrosis using multivariate regression adjusted for baseline fibrosis stage. At baseline, 108 genes had higher expression in fibrosis regressors and lower expression in fibrosis progressors compared to stable fibrosis individuals (Table S3). These genes were enriched in pathway terms related to different aspects of lipid metabolism (Tables S4 and S5). Overall, 44 genes (40.7%) were annotated with at least one GO or Reactome pathway term for lipid metabolism (Table S6). By contrast, 187 genes had higher baseline expression in fibrosis progressors and lower baseline expression in fibrosis regressors (Table S7). These genes were enriched in terms related to actin cytoskeleton, cell projections such as cell leading edge and RHO GTPase activity (Tables S8-S11). Taken together, 89 genes (47.6%) were annotated with at least one GO or Reactome term related to cytoskeleton biology (Table S12). We observed no other biological themes in the enrichment analysis results. We used umbrella terms "lipid metabolism" and "cytoskeleton" to denote these 44 and 89 genes and aggregated their expression into summary scores, i.e., normalized expression averages (Table S13). Analysis of these summary scores confirmed that fibrosis progressors had lower baseline expression of lipid metabolism markers (Fig. 2a) and higher expression of cytoskeleton markers (Fig. 2b) irrespective of their fibrosis severity and histological activity of non-alcoholic steatohepatitis at baseline.   www.nature.com/scientificreports/ Lipid metabolism and cytoskeleton scores were also calculated in Visium data (Tables S14-S16) and correlated with summary scores for liver cell types, zonation and extracellular matrix. Lipid metabolism score correlated with hepatocyte score (Fig. 2c) while cytoskeleton score correlated with extracellular matrix (ECM) (Fig. 2d) and mesenchyme scores (Fig. 2e) in bulk NAFLD livers. Lipid metabolism score co-localized with the hepatocyte fraction in spatial transcriptomics on both normal and diseased livers. By contrast, cytoskeleton score co-localized with fibrosis areas marked by high expression of ECM and mesenchyme scores in diseased livers (Fig. 2f). We could not further deconvolute which cell type of the mesenchymal lineage (stellate cells, activated stellate cells, vascular smooth muscle cells or fibroblasts) could lead to co-localization of this expression signal.
Based on this initial analysis, we formulated two study hypotheses: 1. Fibrosis regressors have higher baseline expression of the lipid metabolism score in hepatocytes than stablefibrosis individuals. 2. Fibrosis progressors have higher baseline expression of the cytoskeleton score in the fibrotic niche than the fibrosis-stable patients To facilitate reading, the study hypotheses were formulated relative to the patient group with higher expression. However, the differences in lipid metabolism and cytoskeleton scores were observed in both directions (i.e., two-sided test, 1.-higher in fibrosis regressors/lower in progressors, 2.-higher in progressors/lower in regressors) and had approximately same magnitude compared to the fibrosis-stable group. The corresponding null hypotheses were that there were no group differences.
Hypothesis 1 was prioritized as the primary endpoint because we considered isolation of fibrotic niche more technically challenging and more likely to fail. However, both hypotheses were biologically plausible [54][55][56] , and sample sizes were calculated to address both hypotheses.
Next, we needed to turn these hypotheses into quantifiable study endpoints. Hepatocytes are the most abundant fraction of liver tissue even in individuals with advanced fibrosis [39][40][41] , so, effect size within the hepatocyte fraction would be expected to be only slightly different from the effect size observed in bulk liver. By contrast, fibrosis areas constitute 2.8% (2.1-3.6%) in F1, 4.3% (3.3-5.4%) in F2, 4.8% (3.7-7.4%) in F3 and 12.3% (8.4-18.5%) in F4 NAFLD fibrosis stages 38 . Thus, expression signal originating within the fibrotic niche may be "diluted" in bulk liver, and we needed to translate the effect sizes between bulk liver and fibrotic niche and hepatocyte fraction. Individual genes that contributed to the summary scores did not always match the spatial expression pattern for the whole score. We found two representative markers PON1 and FLNA that had the most similar spatial expression pattern to the lipid metabolism score and cytoskeleton score, respectively (Figs. S2 and S3, 3) and based the subsequent calculations on these two genes. Based on the calibration curve analysis (Fig. S4), we formulated the primary and secondary endpoints as: 1. Fibrosis regressors have 0.42 ± 0.02 log2 fold higher baseline expression of PON1 in hepatocytes than stable fibrosis individuals. 2. Fibrosis progressors have 1.5 ± 0.11 log2 fold higher baseline expression of FLNA in the fibrotic niche than stable fibrosis individuals.
Initial simulation study. Using the NanoString GeoMx ® Human Whole Transcriptome Atlas Digital Spatial Profiler, gene expression data is typically collected from 6 to 12 regions of illumination (ROIs) per tissue sample (MAN-10108-01_GeoMx_DSP_Experimental_Design_Guideline.pdf from http:// www. nanos tring. com/ GeoMx DSP). Historic data indicated that tissue samples may have uneven thickness and cell density may be variable within tissue sample. So, the expression measurements were expected to have large technical variability in addition to biological variability of expression for PON1 in hepatocytes and FLNA in the fibrotic niche. Given that the outcome of the experiment may be affected by the sampling variability within the tissue sample, we conducted a simulation study to accurately mirror this sampling variability and estimate the required number of patients, ROI diameter and number of ROIs per patient. We applied three mixed effect negative binomial models implemented in three different software packages to synthetic (artificial) NanoString data with varied ROI diameter.
Interestingly, the three methods applied head-to-head on the same synthetic data resulted in distinct power surfaces (Figs. 4 and 5). However, in all cases, power to detect expression differences between the groups increased with increasing number of patients per group but not necessarily with increasing number of ROIs per patient (visible in Figs. 4 and 5 in each column corresponding to a fixed N patients).
Final simulation study. The initial simulation study helped us to narrow down the range of experimental conditions to the largest ROI size 165 μm, two ROIs per patient and at least 4 patients per group. We repeated the simulation study but with 1,000 synthetic data sets per condition ( Table 2). All three methods were run on the same synthetic data, i.e., PON1 and FLNA read counts used to evaluate GeoDiff and glmer.nb and GLMMadaptive LRT methods were exactly the same. GeoDiff overestimated log2 fold change for the primary endpoint and underestimated log2 fold change for the secondary endpoint relative to the actual spike-in. Glmer.nb and GLMMadaptive LRT resulted in smaller log2 fold change estimates in the synthetic NanoString data sets than the actual spike-in but were close for the primary endpoint (PON1 in hepatocyte fraction median log2 FC 0.33 in simulation vs 0.42 in calibration curve analysis). Glmer.nb and GLMMadaptive LRT were faster than GeoDiff (e.g., final simulation for the secondary endpoint GeoDff 16.4 h, glmer.nb 6.6 h and GLMMadaptive LRT 2.4 h). However, we did not optimize the code, so runtimes could be reduced further. Convergence rate was 100% for Taking all these factors into consideration, the final sample size was based on the glmer.nb method. We calculated that we would need (10 patients × 2 hepatocyte ROIs + 5 patients (subset of the ten) × 2 fibrotic niche ROIs) × 3 experimental groups (stable fibrosis, fibrosis regressors, fibrosis progressors) = 90 samples subjected to sequencing at baseline. If paired follow-up biopsies are profiled for exploratory analysis from the same patients and for the same tissue areas, then the total sample size for baseline and equal number of follow-up samples is 180 samples. Sensitivity analysis. The summary scores calculated from normalized data could also be approximated as sum of raw gene counts for all genes constituting the score. For example, Spearman correlation between lipid metabolism score calculated on normalized counts and sum of raw counts for the 44 lipid metabolism genes was median 0.79 (IQR 0.72-0.82) in the spatial transcriptomics samples. Sensitivity analysis using sum of gene counts resulted in similar fold change estimates in calibration curve analysis (Fig. S5) and similar sample size estimates (Table S17). Fourteen and four patients per group were sufficient to reach 80% power for testing group differences in expression of the lipid metabolism and cytoskeleton genes in the hepatocyte fraction and fibrotic niche, respectively.

Discussion
Sample size calculation is a crucial step in planning studies that are appropriately powered to answer the research question. Limited methodological research is available for sample size calculation for spatial transcriptomics in clinical research. In this study, we outline how to gather required inputs (e.g. estimate effect size based on historic data) and perform sample size estimation for comparison between patient groups. Our biological question of interest was the difference in hepatic gene expression levels between patient groups (fibrosis progressors, fibrosis regressors and stable fibrosis patients) assayed with NanoString GeoMx spatial transcriptomics. The sampling variability within the tissue had to be accounted for but was not of primary interest. The three mixed effect model methods produced distinct power surfaces in the initial simulation run despite having similar conceptual basis. We attribute these differences to nuances of the software implementation pertaining to calculation of model likelihood and p-values. Still, with all three methods, power depended on the number of patients per group and not ROIs per patient, which is consistent with theoretical considerations 57 . The power of the study depends on the hypothesis/model parameter being tested (in this case, fixed effect regression coefficient for patient group). The number of ROIs per patient affects the ability to accurately estimate the random effects (in this case, patientspecific intercept terms) and not the fixed effect regression coefficients 57 . This represents sharp contrast to the current literature that focuses primarily on spatial variability of expression within the tissue and, accordingly, increasing the number of spots or ROIs per tissue sample [6][7][8] . Whilst group differences between patients could be addressed with conventional experimental methods such as RNA-sequencing or even qPCR on FACS sorted or laser micro dissected tissue, this presents an impractical barrier in human research when spatial transcriptomics with the NanoString platform enables reliable quantification of gene expression levels from small areas of archived formalin-fixed paraffin-embedded tissue. www.nature.com/scientificreports/ Due to low amount of input material, there is a risk that some ROIs will fail technical quality control after acquisition of sequencing data. Therefore, we started the simulation with 2 ROIs per patient and endpoint. If one ROI fails technical quality control in a subset of individuals, the remaining replicates should still be sufficient to address the study endpoint. For example, if FLNA expression in 165 μm fibrotic niche ROIs was compared between 5 fibrosis progressors and 5 fibrosis-stable individuals (5 × 2 × 2 ROI = 20 ROIs) and a third (6) of the ROIs were removed ("failed quality control") but in a way that did not result in complete loss of the affected patients, then power to detect the group difference with glmer.nb still remained 78.9% based on evaluation of 1000 synthetic data sets.glmer.nb yielded similar log2 fold change estimates to the actual spike-in, had good convergence rate and short runtime, so we decided to test primary and secondary endpoints with this method. We set the alpha level to 0.025 (0.05/2 endpoints). We decided that the primary endpoint will be tested first, the secondary endpoint will be tested next and any other analyses such as transcriptome-wide differential expression will be unpowered exploratory. Standard deviations of random intercepts were generally low: median 0.3 (IQR 0.1-0.5) for FLNA in fibrotic niche and median 0.1 (IQR 0.05-0.2) for PON1 in hepatocytes for all N patients and ROI size 165 µm in the final simulation study. Type I error in this scenario should be close to the theoretical level 48 and we did not implement any additional steps to control it. However, sensitivity analysis could be conducted via bootstrap method 48 . In addition, expression of summary scores for the lipid metabolism and cytoskeleton markers could be compared between fibrosis progressors, fibrosis regressors and stable fibrosis individuals in sensitivity analysis. We considered this an acceptable approach. If transcriptome-wide significance level was desired, alpha could be set to 1.25e−05, which corresponds to (0.05/2 endpoints)/ca. 2000 reliably detected genes with 165 μm ROI in our reference data. At alpha 1.25e−05, 27 patients per group would be required to achieve 80% power for the primary endpoint (PON1 in hepatocytes) and 15 patients per group would be required for the secondary endpoint (FLNA in fibrotic niche).   24 . Therefore, the hypotheses formulated in this cohort might not generalize to all-comer NAFLD population.
Second, the initial study hypotheses (scores) were based on non-overlapping gene sets. We assumed that the two endpoints pertained to different aspects of biology and different tissue fractions and could therefore be considered independent, which may not necessarily be the case.
Third, NAFLD Brunt fibrosis score reflects the relative amount of fibrotic tissue area within the liver. We assumed that if the amount of fibrotic tissue subjected to sequencing (i.e., the ROI size) is fixed, then we do not need to adjust for fibrosis stage any longer. However, this assumption may not hold true if microscopic structure (e.g., cell composition or cell density) differs within fibrotic niche in patients with different fibrosis severity.
Finally, we focused on the sample size calculation and left other experimental considerations outside of the scope of the current study. For example, identification of ROIs (antibody selection, labelling with fluorescent dyes, reagent concentration etc.) and sample randomization (the instrument has throughput of 4-8 slides a day, "GeoMx ® Digital Spatial Profiler (DSP) Project Design Guide" document from http:// www. nanos tring. com/ GeoMx DSP) are separate topics. www.nature.com/scientificreports/ In conclusion, NanoString GeoMx is an emerging spatial transcriptomics technology. While some considerations such as definition of regions of interest and their size are technology-specific, we believe that the sample size calculations for spatial transcriptomics, as for any other technology, are predominantly shaped by the biological or clinical study hypothesis. The study hypothesis helps to define the analysis method (statistical model), the associated assumptions and the expected effect size or minimal clinically meaningful difference. In the absence of knowledge-based hypothesis or a dedicated pilot spatial transcriptomics study, the study endpoints can be inferred from prior bulk RNA-sequencing data and re-scaled to tissue fractions corresponding to ROIs in spatial transcriptomics as illustrated in our case study.

Data availability
Bulk RNA-sequencing GSE193066 and spatial transcriptomics Wu